Lecture’s plan

  1. Convolutional Neural Network (CNN)
  2. Encoder-Decoder
  3. Transformers

Convolutional Neural Network

Convolutional Neural Network (CNN)

  • Convolutional Neural Networks, or Convolutional Networks, or CNNs, or ConvNets
  • For processing data with a grid-like or array topology
    • 1-D grid: time-series data, sensor signal data
    • 2-D grid: image data
    • 3-D grid: video data
  • CNNs include four key ideas related to natural signals:
    • Local connections
    • Shared weights
    • Pooling
    • Use of many layers

CNN architecture

  • Intuition: Neural network with specialized connectivity structure
    • Stacking multiple layers of feature extractors, low-level layers extract local features, and high-level layers extract learn global patterns.
  • There are a few distinct types of layers:
    • Convolutional Layer: detecting local features through filters (discrete convolution)
    • Non-linear Layer: normalization via Rectified Linear Unit (ReLU)
    • Pooling Layer: merging similar features

Building-blocks for CNNs

(1) Convolutional layer

  • The core layer of CNNs
  • Convolutional layer consists of a set of filters, \(W_{kl}\)
  • Each filter covers a spatially small portion of the input data, \(Z_{i,j}\)
  • Each filter is convolved across the dimensions of the input data, producing a multidimensional feature map.
  • As we convolve the filter, we are computing the dot product between the parameters of the filter and the input.
  • Deep Learning algorithm: During training, the network corrects errors and filters are learned, e.g., in Keras, by adjusting weights based on Stochastic Gradient Descent, SGD (stochastic approximation of GD using a randomly selected subset of the data).
  • The key architectural characteristics of the convolutional layer is local connectivity and shared weights.

Convolutional layer: Local connectivity

  • Neurons in layer m are only connected to 3 adjacent neurons in the m-1 layer.
  • Neurons in layer m+1 have a similar connectivity with the layer below.
  • Each neuron is unresponsive to variations outside of its receptive field with respect to the input.
    • Receptive field: small neuron collections which process portions of the input data.
  • The architecture thus ensures that the learnt feature extractors produce the strongest response to a spatially local input pattern.

Convolutional layer: Shared weights

  • We show 3 hidden neurons belonging to the same feature map (the layer right above the input layer).
  • Weights of the same color are shared—constrained to be identical.
  • Replicating neurons in this way allows for features to be detected regardless of their position in the input.
  • Additionally, weight sharing increases learning efficiency by greatly reducing the number of free parameters being learnt.

Convolution without padding

Convolution with padding

(2) Non-linear layer

  • Intuition: Increase the nonlinearity of the entire architecture without affecting the receptive fields of the convolution layer
  • A layer of neurons that applies the non-linear activation function, such as,
    • \(f(x)=maxa(0,x)\) - Rectified Linear Unit (ReLU);
    fast and most widely used in CNN
    • \(f(x)=\text{tanh}x\)
    • \(f(x)=|\text{tanh}a𝑥|\)
    • \(f(x)=(1+𝑒^{−𝑥})^{−1}\) - sigmoid

(3) Pooling layer

  • Intuition: to progressively reduce the spatial size of the representation to reduce the amount of parameters and computation in the network, and hence to also control overfitting
  • Pooling partitions the input image into a set of non-overlapping rectangles and, for each such sub-region, outputs the maximum value of the features in that region.

Pooling (down sampling)

Other layers

  • The convolution, non-linear, and pooling layers are typically used as a set. Multiple sets of the above three layers can appear in a CNN design.
    • Input → Conv. → Non-linear → Pooling Conv. → Non-linear → Pooling → … → Output
  • Recent CNN architectures have 10-20 such layers.
  • After a few sets, the output is typically sent to one or two fully connected layers.
    • A fully connected layer is a ordinary neural network layer as in other neural networks.
    • Typical activation function is the sigmoid function.
    • Output is typically class (classification) or real number (regression).

Other layers

  • The final layer of a CNN is determined by the research task.
  • Classification: Softmax Layer \[P(y=j|\boldsymbol{x}) = \frac{e^{w_j \cdot x}}{\sum_{k=1}^K{e^{w_k \cdot x}}}\]
    • The outputs are the probabilities of belonging to each class.
  • Regression: Linear Layer \[f(\boldsymbol{x}) = \boldsymbol{w} \cdot \boldsymbol{x}\]
    • The output is a real number.

CNN for Text

CNN

Main CNN idea for text:

Compute vectors for n-grams and group them afterwards



Example: “this takes too long” compute vectors for:

This takes, takes too, too long, this takes too, takes too long, this takes too long

CNN for text classification

CNN with multiple filters

Build a CNN in Keras

  • The Sequential model is used to build a linear stack of layers.
  • The following code shows how a typical CNN is built in Keras.
import numpy as np
import keras
from keras.models import Sequential
from keras.layers import Dense, Flatten
from keras.layers import Conv2D, MaxPooling2D
from keras.optimizers import SGD
![image.png](attachment:image.png)

Note:

Dense is the fully connected layer;

Flatten is used after all CNN layers

and before fully connected layer;

Conv2D is the 2D convolution layer;

MaxPooling2D is the 2D max pooling layer;

SGD is stochastic gradient descent algorithm.

Encoder-Decoder

Encoder-Decoder

  • RNN: input sequence is transformed into output sequence in a one-to-one fashion.
  • Goal: Develop an architecture capable of generating contextually appropriate, arbitrary length, output sequences
  • Applications:
    • Machine translation,
    • Summarization,
    • Question answering,
    • Dialogue modeling.

Simple recurrent neural network illustrated as a feed-forward network

 

Most significant change: new set of weights, U - connect the hidden layer from the previous time step to the current hidden layer. - determine how the network should make use of past context in calculating the output for the current input.

Simple-RNN abstraction

(Simple) Encoder decoder networks

 

  • Encoder generates a contextualized representation of the input (last state).
  • Decoder takes that state and autoregressively generates a sequence of outputs

General encoder decoder networks

Abstracting away from these choices

  1. Encoder: accepts an input sequence, \(x_{1:n}\) and generates a corresponding sequence of contextualized representations, \(h_{1:n}\)
  2. Context vector \(c\): function of \(h_{1:n}\) and conveys the essence of the input to the decoder.
  3. Decoder: accepts \(c\) as input and generates an arbitrary length sequence of hidden states \(h_{1:m}\) from which a corresponding sequence of output states \(y_{1:m}\) can be obtained.

Popular architectural choices: Encoder

Decoder basic design

  • produce an output sequence an element at a time

Decoder design
enhancement

Decoder: How output y is chosen

  • Sample soft-max distribution (OK for generating novel output, not OK for e.g. MT or Summ)
  • Most likely output (doesn’t guarantee individual choices being made make sense together)

Transformers

Transformers

(Attention is all you need!)

Transformers

Transformers

Summary

Summary

  • Convolutional Neural Network (CNN)
  • Encoder-Decoder
  • Transformers

Time for Practical 7!